library(lessR)
library(ggplot2)
data <- read.csv(file="NutritionStudy.csv",head=TRUE,sep=",", stringsAsFactors = TRUE)
1) Consider the continuous variable, FIBER. Is this variable correlated with Cholesterol? Obtain a scatterplot and appropriate statistics to address this question.

The scatter plot does show a certain degree of correlation. Looking at the lowest dots across the X axis, I do make out a slight upward trend up through the Fiber <30. The same can be said for the overall trend of the scatter plots. However, the upward direction does seem to tail and taper off at Fiber = 30. There aren’t that many points at Fiber = 30 anyways, so potentially we can ignore that diminishing upward trend.

Generating the correlation coefficient, you get a value of .15, which is low, but a positive correlation nonetheless.

ggplot(data = data, aes(x= Fiber, y=Cholesterol)) + geom_point() + theme_classic() + geom_smooth(method = 'loess') + labs(title = "Fiber vs Cholestorol")
## `geom_smooth()` using formula 'y ~ x'

cor(data$Cholesterol,data$Fiber)
## [1] 0.1539684
2) Fit a simple linear regression model that uses FIBER to predict CHOLESTEROL(Y). Report the model, interpret the coefficients, discuss the goodness of fit.

The Equation for this model is \(Y=193.701+3.813X_{1}\).
The interpretation is that if someone doesn’t eat fiber, their cholestorol will be 193.701.
For every unit of fiber they eat, their cholestorol will increase by 3.813.

The \(R^2\) is .02, which means it only explains 2.0% of the variance. This indicates that the fiber variable is not enough to predict cholestorol. The variable is significant though, at a 5% cutoff, so we reject the null hyptohesis that the slope is 0. So any relationship we establish between these 2 variables can be real (and not due to chance).
This may be a variable I would keep throughout the analysis

The F test:
\(H_{0}: \beta_{1} = 0\)
\(H_{a}: \beta_{i} \neq 0\)
Looking at the ANOVA and the F statistic, we can reject the null hypotheis that \(\beta = 0\).

While the \(R^2\) is low because it doesnt explain all the variance, the hypothesis tests is telling us that the variable does have predictive power and that the slope not equal to 0 is not due to chance.

Checking assumptions:
Linearity - we do see more or less a linear pattern, until the < 30 fiber counts. But we dont see any non-linear patterns like a polynomial, exponential, etc etc.
Homoscedasticity - The residuals vs fitted graph dont appear to be completly constant. Towards the higher end of the fitted values, we dont see the same distribution around the x=0 as the other values.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends. multicollinearity - Since we are only using 1 predictor, this does not apply.
The scale location plot tells the same story as the residual plot, but something interesting is that looking at the red line, the magnitude of the residuals are fairly flat across all X’s.
levearage: We dont see any outliers as all the points are within the cooks distance line.

model1 <- lm(Cholesterol~Fiber, data=data)
summary(model1)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -216.48  -88.58  -34.54   61.18  652.10 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)  193.701     19.157  10.111 < 0.0000000000000002
## Fiber          3.813      1.383   2.757              0.00618
## 
## Residual standard error: 130.6 on 313 degrees of freedom
## Multiple R-squared:  0.02371,    Adjusted R-squared:  0.02059 
## F-statistic:   7.6 on 1 and 313 DF,  p-value: 0.006179
plot(model1)

anova(model1)
## Analysis of Variance Table
## 
## Response: Cholesterol
##            Df  Sum Sq Mean Sq F value   Pr(>F)
## Fiber       1  129684  129684  7.6002 0.006179
## Residuals 313 5340757   17063
3) For the ALCOHOL categorical variable, create a set of dummy coded (0/1) indicator variables. Fit a multiple linear model that uses the FIBER continuous variable and the ALCOHOL dummy coded variables to predict the response variable Y=CHOLESTEROL. Remember to leave one of the dummy coded variables out of the model so that you have a basis of interpretation for the constant term. Report the model, interpret the coefficients, discuss hypothesis test results, goodness of fit statistics, diagnostic graphs, and leverage, influence and Outlier statistics. This is called an Analysis of Covariance Model (ANCOVA)

The Equation for this model is \(Y=189.266+3.813X_{1}-2.523X_{2}+44.429X_{3}\).
The interpretation is as follows:
when all betas are equal to 0 (no influence in any of the variables or if the value is -), the cholestorol is 189.266.

Fiber Variable:
For alcohol = 0 as fiber increases 1 unit, cholestorol increases 3.813 units.

Alcohol Variable:
using 0 alcohol consumption as the basis for comparison, for someone who drinks >0 alcohol and <10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then decrease by -2.523.

using 0 alcohol consumption as the basis for comparison, for someone who drink >10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then increasee it by +44.429

The adjusted \(R^2\) is .02363, which means it only explains 2.4% of the variance. The p value for fiber is significant while the p value for alcohol is insignificant.

The F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3}= 0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Looking at the ANOVA and also looking at the F statistic in the summary table, we see p-value at .015 and we do see signficant Pr(>F). With this, we can reject the null hypothesis that \(H_{0}\) that all \(\beta\) = 0.

While the \(R^2\) is low because it doesnt explain all the variance, the hypothesis tests is telling us that the variable does have predictive power and that the slope not equal to 0 is not due to chance.

Checking assumptions:

Homoscedasticity - The residuals vs fitted graph dont appear to be completly constant. Towards the higher end of the fitted values, we dont see the same distribution around the x=0 as the other values. (Same as the prior chart with only fiber as a variable)
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends.
multicollinearity - Plotting a simple scatter of fiber vs alcohol, we do not see obvious signs of multicollinearity.
The scale location plot tells the same story as the residual plot, but something interesting is that looking at the red line, the magnitude of the residuals are fairly flat across all X’s.
levearage: We dont see any outliers as all the points are within the cooks distance line.

data$Alcohol_d0 <- ifelse(data$Alcohol==0,1,0)
data$Alcohol_d1 <- ifelse(data$Alcohol<10 & data$Alcohol>0,1,0)
data$Alcohol_d2 <- ifelse(data$Alcohol>=10,1,0)

model2 <- lm(Cholesterol~Fiber + Alcohol_d1 + Alcohol_d2, data=data)

summary(model2)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218.31  -91.83  -32.24   64.65  654.06 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)  189.266     21.065   8.985 < 0.0000000000000002
## Fiber          3.984      1.389   2.868              0.00441
## Alcohol_d1    -2.523     15.836  -0.159              0.87352
## Alcohol_d2    44.429     28.429   1.563              0.11912
## 
## Residual standard error: 130.4 on 311 degrees of freedom
## Multiple R-squared:  0.03296,    Adjusted R-squared:  0.02363 
## F-statistic: 3.533 on 3 and 311 DF,  p-value: 0.01518
anova(model2)
## Analysis of Variance Table
## 
## Response: Cholesterol
##             Df  Sum Sq Mean Sq F value   Pr(>F)
## Fiber        1  129684  129684  7.6239 0.006102
## Alcohol_d1   1    9065    9065  0.5329 0.465927
## Alcohol_d2   1   41545   41545  2.4423 0.119117
## Residuals  311 5290147   17010
plot(model2)

plot(data$Fiber,data$Alcohol)

4) Use the ANCOVA model from task 3) to obtain predicted values for CHOLESTEROL(Y). Now, make a scatterplot of the Predicted Values for Y (y-axis) by FIBER (X), but color code the records for the different groups of ALCOHOL. What do you notice about the patterns in the predicted values of Y? Now, make a scatterplot of the actual values of CHOLESTEROL(Y) by FIBER (X), but color code by the different groups of the ALCOHOL variable. If you compare the two scatterplots, does the ANCOVA model appear to fit the observed data very well? Or, is a more complex model needed?

Plotting a scatter plot of fitted versus fiber and grouped by alcoholgroup, we see a straight line for each alcohol group. The lines are all parallel to each other (with the same slope).
Now, when we plot the actual versus fiber, and color code based on alcohol group, we see a much more complex and interwined scatter plot.
This is telling me that the fitted model did not fit the observed data very well. The fitted seemed to oversimplfy the relationship beteween fiber and cholestorol. A more complex model may be needed.

alcoholdata <-data
alcoholdata$fitted = model2$fitted.values
alcoholdata$alcoholgroup <- ifelse(alcoholdata$Alcohol_d0==1,1,ifelse(alcoholdata$Alcohol_d1==1,2,3))
ggplot(data=alcoholdata, aes(x=Fiber, 
                      y=fitted, 
                      color = factor(alcoholgroup)))+geom_point() + labs(title = "fiber vs cholestorol by group")

#+ geom_smooth(method=lm, aes(fill=factor(alcoholgroup)))

Plot(Fiber,Cholesterol, by = alcoholgroup,  fit = TRUE, data=alcoholdata)

## >>> Suggestions
## Plot(Fiber, Cholesterol, out_cut=.10)  # label top 10% potential outliers
## Plot(Fiber, Cholesterol, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Fiber, Cholesterol, enhance=TRUE)  # many options, including the above
## Plot(Fiber, Cholesterol, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 315 
## 
## 
## Sample Correlation of Fiber and Cholesterol: r = 0.154 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 2.757,  df: 313,  p-value: 0.006 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.044      Upper Bound: 0.260
5) Create new interaction variables by multiplying the dummy coded variables for ALCOHOL by the continuous FIBER(X) variable. Save these product variables to your dataset. Now, to build the model, start with variables in your ANCOVA model from task 4) and add the interaction variables you just created into the multiple regression model. Don’t forget, there is one category that is the basis of interpretation. DO NOT include any interaction term that is associated with that category. This is called an Unequal Slopes Model. Fit this model, and save the predicted values. Plot the predicted values for CHOLESTEROL (Y) by FIBER(X). Discuss what you see in this graph. In addition, report the model, interpret the coefficients, discuss hypothesis test results, goodness of fit statistics, diagnostic graphs, and leverage, influence and Outlier statistics.

The Equation for this model is \(Y=230.3434+.6363X_{1}-62.8481X_{2}-63.3814X_{3}+4.7976X_{5} + 9.0742X_{6}\).

The interpretation is as follows:
when all betas are equal to 0 (no influence in any of the variables), the cholestorol is 230.3434

Fiber Variable:
For alcohol = 0 as fiber increases 1 unit, cholestorol increases .6363 units.

Alcohol Variable:
using 0 alcohol consumption as the basis for comparison, for someone who drinks >0 alcohol and <10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then decrease by -62.848

using 0 alcohol consumption as the basis for comparison, for someone who drink >10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then decrease it by 63.3814

Interacton Variables: Looking at the interaction of <10 alcohol and fiber, the coefficient of 4.7976 tells us that the original slope of .6363 is now (4.7976 + .6363) = 5.4339.

Looking at the interaction of >10 alcohol and fiber, the coefficient of 9.0742 is telling us that the the original slope of .6363 is now (.6363 + 9.0742) = 9.7105.

This gives us the impression that higher alcohol volume combined with fiber, yields a higher rate of change in cholestorol. The adjusted \(R^2\) is .028, which means it only explains 2.8% of the variance. The p value for fiber is significant while the p value for alcohol and it’s interactions is insignificant.

The F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3}= \beta_{4} = \beta_{5} =0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Looking at the ANOVA and the F statistic and also looking at the F statistic in the summary table, we see p-value at .015 and we do see some signficant Pr(>F). With this, we can reeject the null hypothesis that \(H_{0}\) that all \(\beta\) = 0.

Checking assumptions:

Homoscedasticity - The residuals vs fitted graph dont appear to be completly constant. Towards the higher end of the fitted values, we dont see the same distribution around the x=0 as the other values. (Same as the prior chart with only fiber as a variable)
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends. multicollinearity - Plotting a simple scatter of fiber vs alcohol, we do not see obvious signs of multicollinearity.
The scale location plot tells the same story as the residual plot, but something interesting is that looking at the red line, the magnitude of the residuals are fairly flat across all X’s.
levearage: We dont see any outliers as all the points are within the cooks distance line.

alcoholdata$fxd0 <- alcoholdata$Fiber*alcoholdata$Alcohol_d0
alcoholdata$fxd1 <- alcoholdata$Fiber*alcoholdata$Alcohol_d1
alcoholdata$fxd2 <- alcoholdata$Fiber*alcoholdata$Alcohol_d2

model3 <- lm(Cholesterol~Fiber + Alcohol_d1 + Alcohol_d2 + fxd1 + fxd2, data=alcoholdata)


alcoholdata$unequalslopefitted <- model3$fitted.values
summary(model3)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2 + 
##     fxd1 + fxd2, data = alcoholdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184.25  -88.39  -25.85   64.40  661.19 
## 
## Coefficients:
##             Estimate Std. Error t value         Pr(>|t|)
## (Intercept) 230.3434    31.5413   7.303 0.00000000000241
## Fiber         0.6363     2.3655   0.269            0.788
## Alcohol_d1  -62.8481    40.5528  -1.550            0.122
## Alcohol_d2  -63.3814    85.4549  -0.742            0.459
## fxd1          4.7976     2.9565   1.623            0.106
## fxd2          9.0742     6.8735   1.320            0.188
## 
## Residual standard error: 130.1 on 309 degrees of freedom
## Multiple R-squared:  0.04366,    Adjusted R-squared:  0.02819 
## F-statistic: 2.821 on 5 and 309 DF,  p-value: 0.01651
anova(model3)
## Analysis of Variance Table
## 
## Response: Cholesterol
##             Df  Sum Sq Mean Sq F value   Pr(>F)
## Fiber        1  129684  129684  7.6597 0.005987
## Alcohol_d1   1    9065    9065  0.5354 0.464888
## Alcohol_d2   1   41545   41545  2.4538 0.118265
## fxd1         1   29048   29048  1.7157 0.191221
## fxd2         1   29508   29508  1.7429 0.187754
## Residuals  309 5231592   16931
plot(model3)

ggplot(data=alcoholdata, aes(x=Fiber, 
                      y=unequalslopefitted, 
                      color = factor(alcoholgroup)))+geom_point() + labs(title = "fiber vs cholestorol with interaction")

6) You should be aware that the models of Task 4) and Task 5) are nested. Which model is the full and which one is the reduced model? Write out the null and alternative hypotheses for the nested F-test in this situation to determine if the slopes are unequal. Use the ANOVA tables from those two models you fit previously to compute the F-statistic for a nested F-test using Full and Reduced models. Conduct and interpret the nested hypothesis test. Are there unequal slopes? Discuss the findings.

The full model is the model with the interactions and the reduced model is the model without the interactions.

Nested F test:
\(H_{0}: \beta_{4} = \beta_{5}=0\)
\(H_{a}: \beta_4 \neq 0\) or \(\beta_5 \neq 0\)

The nested F test tells us that the betas for the interactions are not signficant, so we can’t reject the null hypthesis that the slope is 0. This means that we can’t definitively say that there is a relationship between the interactions to cholestorol. And since we arent rejecting the null hypothesis, we can’t say that the slopes are different.

anova(model2, model3)
## Analysis of Variance Table
## 
## Model 1: Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2
## Model 2: Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2 + fxd1 + fxd2
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    311 5290147                           
## 2    309 5231592  2     58556 1.7293 0.1791
7) Now that you’ve been exposed to these modeling techniques, it is time for you to use them in practice. Let’s examine more of the NutritionStudy data. Use the above practiced techniques to determine if SMOKE, VITAMINS, or GENDER interacts with the FIBER variable and influences the amount of CHOLESTEROL. Formulate hypotheses, construct essential variables (as necessary), conduct the analysis and report on the results. Which categorical variables are most predictive of CHOLESTEROL, in conjunction with FIBER.

creating dummies for smoke, vitamins, and gender

data$Smoke_d0 <- ifelse(data$Smoke=="No",1,0)
data$Smoke_d1 <- ifelse(data$Smoke=="Yes",1,0)

data$VitaminUse_d0 <- ifelse(data$VitaminUse=="No",1,0)
data$VitaminUse_d1 <- ifelse(data$VitaminUse=="Occasional",1,0)
data$VitaminUse_d2 <- ifelse(data$VitaminUse=="Regular",1,0)

data$Gender_d0 <- ifelse(data$Gender=="Female",1,0)
data$Gender_d1 <- ifelse(data$Gender=="Male",1,0)

#Interactions
data$fxsmoked0 <- data$Fiber*data$Smoke_d0
data$fxsmoked1 <- data$Fiber*data$Smoke_d1

data$fxVitaminUsed0 <- data$Fiber*data$VitaminUse_d0
data$fxVitaminUsed1 <- data$Fiber*data$VitaminUse_d1
data$fxVitaminUsed2 <- data$Fiber*data$VitaminUse_d2

data$fxGenderd0 <- data$Fiber*data$Gender_d0
data$fxGenderd1 <- data$Fiber*data$Gender_d1

Smoke and Fiber.

smokedata <- data
smokedata$smokegroup <- ifelse(smokedata$Smoke_d0==1,1,2)

The Equation for this model is as follows:
Full Model: \(Y=179.184+4.455_{1}+63.059X_{2}-1.597X_{3}\).
Reduced Model: \(181.274 +4.296X_{1}+45.738X_2\).

Interpretation:
For the Full model, when all betas are 0, the cholestorol is 179.184.
When smoke = ‘No’, then cholestorol will be increased by 4.455 for every unit of fiber increase
When smoke = ‘Yes’, cholestorol will be the same as when smoke = ‘No’, but 63.059 will be added to the total cholestorol.
With the interaction of fiber and smoke, for smoke = ‘Yes’, instead of 4.455 for every unit increase, the unit increase will be (4.455-1.597) = 2.858.

The \(R^2\) for both models are low, (2.8 and 3.1 for full and reduced models), so they do not explain all the variance. The p-values are still insignficant except for fiber, so we dont reject the null hypothesis that the betas are equal to 0. This also means that the relationship we see between smoke and cholestorol may be due to chance.

smoke_full <-lm(Cholesterol~Fiber + Smoke_d1 + fxsmoked1, data=smokedata)
smokedata$fitted.full <- smoke_full$fitted.values
summary(smoke_full)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + Smoke_d1 + fxsmoked1, data = smokedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218.86  -87.71  -35.15   65.11  657.36 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)  179.184     20.875   8.583 0.000000000000000447
## Fiber          4.455      1.471   3.028              0.00267
## Smoke_d1      63.059     55.002   1.146              0.25248
## fxsmoked1     -1.597      4.661  -0.343              0.73218
## 
## Residual standard error: 130.1 on 311 degrees of freedom
## Multiple R-squared:  0.03789,    Adjusted R-squared:  0.02861 
## F-statistic: 4.082 on 3 and 311 DF,  p-value: 0.007277
smoke_reduced <-lm(Cholesterol~Fiber + Smoke_d1, data=smokedata)
smokedata$fitted.reduced <- smoke_reduced$fitted.values
summary(smoke_reduced)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + Smoke_d1, data = smokedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -216.77  -87.79  -35.81   65.54  657.56 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)  181.274     19.936   9.093 < 0.0000000000000002
## Fiber          4.296      1.394   3.081              0.00224
## Smoke_d1      45.738     21.611   2.116              0.03510
## 
## Residual standard error: 129.9 on 312 degrees of freedom
## Multiple R-squared:  0.03752,    Adjusted R-squared:  0.03135 
## F-statistic: 6.082 on 2 and 312 DF,  p-value: 0.002563

F tests an Nested F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} =0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Our F test for both full and reduced model is telling us that again, fiber is significant, and smoke is significant at 3.5%

For the Nested F test:
\(H_{0}: \beta_{3}=0\)
\(H_{a}: \beta_3 \neq 0\)

We are not able to reject the null hypothesis with the F statistic.

Using all the above info, it appears that 1) Fiber is a significant variable, 2) The F tests is telling us we are able to reject the null hypothesis for smoke (excluding the interaction), and 3)the interaction term is also not significant, so we can potentially omit this interaction variable in the model. This is also telling us that the unequal slopes may not be real.

anova(smoke_full)
## Analysis of Variance Table
## 
## Response: Cholesterol
##            Df  Sum Sq Mean Sq F value   Pr(>F)
## Fiber       1  129684  129684  7.6630 0.005975
## Smoke_d1    1   75590   75590  4.4666 0.035360
## fxsmoked1   1    1986    1986  0.1173 0.732179
## Residuals 311 5263182   16923
anova(smoke_reduced)
## Analysis of Variance Table
## 
## Response: Cholesterol
##            Df  Sum Sq Mean Sq F value   Pr(>F)
## Fiber       1  129684  129684  7.6847 0.005904
## Smoke_d1    1   75590   75590  4.4793 0.035100
## Residuals 312 5265167   16876
anova(smoke_full, smoke_reduced)
## Analysis of Variance Table
## 
## Model 1: Cholesterol ~ Fiber + Smoke_d1 + fxsmoked1
## Model 2: Cholesterol ~ Fiber + Smoke_d1
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    311 5263182                           
## 2    312 5265167 -1   -1985.6 0.1173 0.7322

Checking assumptions:

Homoscedasticity - The residuals do look random, but it’s not constant throughout all of the fitted values. towards the higher end of the fitted values, it is not longer spread around the 0 mark.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends.
The scale location plot tells the same story as the residual plot: random residuals, but the standarized residuals seem to increase just a tad towards the higher ends of the fitted values
levearage: We dont see any outliers as all the points are within the cooks distance line.

In terms of goodness of it, I feel this is not a perfect dataset that satisifies all linear regression assumptions, but it’s not terrible. Any issues with the prediction result would be more caused by the variable selection than assumption violations.

plot(data$Fiber,data$Smoke)

ggplot(data=smokedata, aes(x=Fiber, 
                      y=fitted.reduced, 
                      color = factor(smokegroup)))+geom_point() + labs(title = "smoke and fiber")

ggplot(data=smokedata, aes(x=Fiber, 
                      y=fitted.full, 
                      color = factor(smokegroup)))+geom_point() + labs(title = "smoke and fiber with interactions")

plot(smoke_full)

plot(smoke_reduced)

fiber and vitamins

vitamindata <- data
vitamindata$vitamingroup <- ifelse(vitamindata$VitaminUse_d0==1,1,ifelse(vitamindata$VitaminUse_d1==1,2,3))

The Equation for this model is as follows:
Full Model: \(Y=208.821+3.111X_1 -19.453X_2 -29.942X_3 + 1.300X_4 + 1.1196X_5\).
Reduced Model: \(198.745 + 3.940X_1 -3.401X_2 - 14.947X_3\).

Interpretation:
For the Full model, when all betas are 0, the cholestorol is 208.821
When vitaminuse = ‘No’, then cholestorol will be increased by 3.111 for every unit of fiber increase.
When vitaminuse = ‘occasional’, cholestorol will be the same as when vitaminUse = ‘No’, but -19.453 will be added to the total cholestorol.
When vitaminuse = ‘regular’, cholestorol will be the same as when vitaminUse = ‘No’, but -29.942 will be added to the total cholestorol.

The Reduced model is interpreted similary where we just shift the fiber coefficient by the vitamin coefficient, minus the interaction components

Interaction terms With the interaction of fiber and VitaminUse, for vitaminUse = ‘Occasional’, instead of 3.111 for every unit increase, the unit increase will be (3.111+1.300) = 4.411

With the interaction of fiber and vitumanUse, for vitaminUse = ‘Regular’, instead of 3.111 for every unit increase, the unit increase will be (3.111+1.1196) = 4.307.

The is all assuming basis of interpretation for vitaminUse= “No”

The \(R^2\) for both models are low, (.01 and .016 for full and reduced models), so they do not explain all the variance. The p-values are still insignficant except for fiber, so we dont reject the null hypothesis that the betas are equal to 0. This also means that the relationship we see between vitamin and cholestorol may be due to chance.

vitamin_full <-lm(Cholesterol~Fiber + VitaminUse_d1 + VitaminUse_d2 + fxVitaminUsed1 + fxVitaminUsed2, data=vitamindata)
vitamindata$fitted.full <- vitamin_full$fitted.values
summary(vitamin_full)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2 + 
##     fxVitaminUsed1 + fxVitaminUsed2, data = vitamindata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -214.64  -91.71  -33.55   63.36  659.80 
## 
## Coefficients:
##                Estimate Std. Error t value       Pr(>|t|)
## (Intercept)     208.821     32.308   6.463 0.000000000399
## Fiber             3.111      2.454   1.267          0.206
## VitaminUse_d1   -19.453     52.883  -0.368          0.713
## VitaminUse_d2   -29.942     43.947  -0.681          0.496
## fxVitaminUsed1    1.300      3.945   0.329          0.742
## fxVitaminUsed2    1.196      3.188   0.375          0.708
## 
## Residual standard error: 131.3 on 309 degrees of freedom
## Multiple R-squared:  0.02681,    Adjusted R-squared:  0.01106 
## F-statistic: 1.702 on 5 and 309 DF,  p-value: 0.1338
vitamin_reduced <-lm(Cholesterol~Fiber + VitaminUse_d1+VitaminUse_d2, data=vitamindata)
vitamindata$fitted.reduced <- vitamin_reduced$fitted.values
summary(vitamin_reduced)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2, 
##     data = vitamindata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -209.93  -88.01  -35.04   62.02  660.16 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)
## (Intercept)    198.745     20.990   9.469 < 0.0000000000000002
## Fiber            3.940      1.393   2.828              0.00498
## VitaminUse_d1   -3.401     19.074  -0.178              0.85861
## VitaminUse_d2  -14.947     17.259  -0.866              0.38714
## 
## Residual standard error: 130.9 on 311 degrees of freedom
## Multiple R-squared:  0.02627,    Adjusted R-squared:  0.01688 
## F-statistic: 2.797 on 3 and 311 DF,  p-value: 0.04034

F tests an Nested F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} = \beta_4 = \beta_5 = 0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Our F test for both full and reduced model is telling us that again, fiber is significant, but the remaing variables are not.

For the Nested F test:
\(H_{0}: \beta_{3} = \beta_4=0\) \(H_{a}: \beta_i \neq 0\)

We are not able to reject the null hypothesis with the F statistic.

Using all the above info, it appears that 1) Fiber is a significant variable, 2) The F tests is telling us we are not able to reject the null hypothesis, so it’s possible that any relationship generated by this model we see to cholestorol is due to chance, and 3)the interaction term is also not significant, so we can potentially omit this variable in the model. This is also telling us that the unequal slopes may not be real.

anova(vitamin_full)
## Analysis of Variance Table
## 
## Response: Cholesterol
##                 Df  Sum Sq Mean Sq F value   Pr(>F)
## Fiber            1  129684  129684  7.5270 0.006433
## VitaminUse_d1    1    1181    1181  0.0686 0.793631
## VitaminUse_d2    1   12846   12846  0.7456 0.388547
## fxVitaminUsed1   1     501     501  0.0291 0.864663
## fxVitaminUsed2   1    2425    2425  0.1407 0.707808
## Residuals      309 5323804   17229
anova(vitamin_reduced)
## Analysis of Variance Table
## 
## Response: Cholesterol
##                Df  Sum Sq Mean Sq F value   Pr(>F)
## Fiber           1  129684  129684  7.5716 0.006278
## VitaminUse_d1   1    1181    1181  0.0690 0.793034
## VitaminUse_d2   1   12846   12846  0.7500 0.387144
## Residuals     311 5326730   17128
anova(vitamin_full,vitamin_reduced)
## Analysis of Variance Table
## 
## Model 1: Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2 + fxVitaminUsed1 + 
##     fxVitaminUsed2
## Model 2: Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    309 5323804                           
## 2    311 5326730 -2   -2926.1 0.0849 0.9186

Checking assumptions:

Homoscedasticity - The residuals do look random, but it’s not constant throughout all of the fitted values. towards the higher end of the fitted values, it is not longer spread around the 0 mark.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends. The scale location plot tells the same story as the residual plot: random residuals, but the standarized residuals seem to increase just a tad towards the higher ends of the fitted values levearage: We dont see any outliers as all the points are within the cooks distance line.

In terms of goodness of it, I feel this is not a perfect dataset that satisifies all linear regression assumptions, but it’s not terrible. Any issues with the prediction result would be more caused by the variable selection than assumption violations.

plot(vitamin_full)

ggplot(data=vitamindata, aes(x=Fiber, 
                      y=fitted.reduced, 
                      color = factor(vitamingroup)))+geom_point() + labs(title = "vitamins and fiber")

ggplot(data=vitamindata, aes(x=Fiber, 
                      y=fitted.full, 
                      color = factor(vitamingroup)))+geom_point() + labs(title = "vitamins and fiber with interactions")

Fiber and Gender

Genderdata <- data
Genderdata$Gendergroup <- ifelse(Genderdata$Gender_d0==0,1,2)

fiber and gender
The Equation for this model is as follows:

Full Model: \(Y=162.359+5.273X_1 +311.514X_2 -16.138X_3\).
Reduced Model: \(184.490 + 3.529X_1 +96.294X_2\).

Interpretation:
For the Full model, when all betas are 0, the cholestorol is 162.359
When Gender = ‘Female’, then cholestorol will be increased by 5.273 for every unit of fiber increase. When Gender = ‘Male’, cholestorol will be the same as when Gender = ‘Female’, but 311.514 will be added to the total cholestorol.

The Reduced model is interpreted similary where we just shift the fiber coefficient by the gender coefficient, minus the interaction components

Interaction terms With the interaction of fiber and gender, for gender = ‘Male’, instead of 5.273 for every unit increase, the unit increase will be (5.273-16.138)= -10.865

The is all assuming basis of interpretation for Gender = “Female”

The \(R^2\) for both models are low, (.11 and .07 for full and reduced models), so they do not explain all the variance. The p-values are still insignficant except for fiber, so we dont reject the null hypothesis that the betas are equal to 0. This also means that the relationship we see between vitamin and cholestorol may be due to chance.

Gender_full <-lm(Cholesterol~Fiber + Gender_d1 + fxGenderd1, data=Genderdata)
Genderdata$fitted.full <- Gender_full$fitted.values
summary(Gender_full)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + Gender_d1 + fxGenderd1, data = Genderdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.55  -80.27  -25.28   53.23  662.41 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)
## (Intercept)  162.359     19.188   8.462 0.00000000000000105
## Fiber          5.273      1.391   3.790            0.000181
## Gender_d1    311.514     60.083   5.185 0.00000039029294889
## fxGenderd1   -16.138      4.233  -3.812            0.000166
## 
## Residual standard error: 124 on 311 degrees of freedom
## Multiple R-squared:  0.1261, Adjusted R-squared:  0.1177 
## F-statistic: 14.96 on 3 and 311 DF,  p-value: 0.000000004028
Gender_reduced <-lm(Cholesterol~Fiber + Gender_d1, data=Genderdata)
Genderdata$fitted.reduced <- Gender_reduced$fitted.values
summary(Gender_reduced)
## 
## Call:
## lm(formula = Cholesterol ~ Fiber + Gender_d1, data = Genderdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -296.10  -85.19  -30.31   56.56  665.39 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)  184.490     18.681   9.876 < 0.0000000000000002
## Fiber          3.529      1.342   2.629              0.00898
## Gender_d1     96.294     21.013   4.583           0.00000665
## 
## Residual standard error: 126.6 on 312 degrees of freedom
## Multiple R-squared:  0.08527,    Adjusted R-squared:  0.07941 
## F-statistic: 14.54 on 2 and 312 DF,  p-value: 0.0000009149

F tests an Nested F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} = 0\)
\(H_{a}:\)at least one \(\beta_{i} \neq 0\)
Our F test for both full and reduced model is telling us that again, fiber is significant gender is also significant

For the Nested F test:
\(H_{0}: \beta_{2} = \beta_3=0\)
\(H_{a}: \beta_2 \neq 0\) or \(\beta_3\)

We are able to reject the null hypothesis with the F statistic.

Using all the above info, it appears that 1) Fiber, gender, and the interaction terms is significant, 2) The F tests is telling us we are able to reject the null hypothesis, so we can say that gender is significant in this model, and also, the interaction is does truly yield unequal slopes.

anova(Gender_full, Gender_reduced)
## Analysis of Variance Table
## 
## Model 1: Cholesterol ~ Fiber + Gender_d1 + fxGenderd1
## Model 2: Cholesterol ~ Fiber + Gender_d1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)
## 1    311 4780527                              
## 2    312 5003953 -1   -223427 14.535 0.0001659

Checking assumptions:

Homoscedasticity - The residuals do look random, but it’s not constant throughout all of the fitted values. towards the higher end of the fitted values, it is not longer spread around the 0 mark.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends.
The scale location plot tells the same story as the residual plot: random residuals, but the standarized residuals seem to increase just a tad towards the higher ends of the fitted values
levearage: We dont see any outliers as all the points are within the cooks distance line.

In terms of goodness of it, I feel this is not a perfect dataset that satisifies all linear regression assumptions, but it’s not terrible. Any issues with the prediction result would be more caused by the variable selection than assumption violations.

plot(Gender_full)

ggplot(data=Genderdata, aes(x=Fiber, 
                      y=fitted.reduced, 
                      color = factor(Gendergroup)))+geom_point() + labs(title= "fiber  vs gender")

ggplot(data=Genderdata, aes(x=Fiber, 
                      y=fitted.full, 
                      color = factor(Gendergroup)))+geom_point()  + labs(title = "fiber vs gender with interactions")

Comparison of all variables
For alcohol: Without interactions, all alcohol groupings are insignificant. P value for F statistic is .015 \(R^2\) is 2.3.
With interactions, all alcohol groupings are insignificant. p value for F statistic is .016. \(R^2\) is 2.8.
Nested F test: we fail to reject the interaction variable.

For smoke
Without interactions, all variables are significant. \(R^2\) for smoke is .031. p value for F statistic is .0025.
With interactions, all smoke variables are not significant. \(R^2\) is .028. p value for F statistic is .007
Nested F test: we fail to reject the interaction variable.

for vitaminUse
Without interactions, all vitamin variables are not significant. \(R^2\) is 1.6%. P value for F statistic is .1338
With interactions, all vitamin variables are not significant. \(R^2\) is 1.6. P value for F statistic is .04
Nested F test: We fail to reject the interaction variable.

For Gender
Without interaction, all variables are significant. \(R^2\) is .079. P value for F statistic is very low.
With interaction, all variables are significant \(R^2\) is 11.8%. P value for the F statistic is extremely low.
Nested F test: the p value for the F statistic is very low.

Based on the 4 comparisons, I feel gender is the best predictor for fiber. It has the highest Correlation coefficient, so we know it explains the most variance. all the variables came out to be signficant, so we know that any relationship to fiber coming out of gender is probabilistically not due to chance. The F test which assesses the overall model fit is also significant, so we are reassured of the performance of the model. Finally, the nested F test tells us the interaction variable is a good predictor, so we can say that each gender group is different from each other.

8) Please write a reflection on your experiences.

I realized that for the prior assignments, I wasn’t methodlogical in assessing my models, so for this assignment and future ones for the rest of the term, I’ve decided to take more effort in addressing each piece of the assumptions. Because of this, I started to realize the “tediousness” of model validation.
One aspect of the assupmotions that I’m still not comfortable in is looking at the residual vs fitted plot. The first one of Fiber vs cholestorol made me ask myself a lot of questions: It looks random, but it’s not “constant” across all X’s. Does that disqualify “constant” variance? It’s not a perfect represenation of “constant” variance, but its also not a perfect represenation of something that violates constant variance (i,e “fanning” shape).
With the new tools we have addressing unequal slopes, I do see the benefit of using this knowledge, but I think the more tests we have, the more complexity and care we have to take in addressing each component. I feel like as we do more of these assignments, it may be good to have a “checklist” of things to look out for (i,e 1. address assumptions, 2. look at scatters, 3. look at interactions, 4. look at un equal slopes, 5. look at F tests etc etc).

Thinking about unequal slopes, I do think about how to use this in my workplace. The most difficult part is knowing which interactions to test (I do wonder, is there a scoring procedure to assess which interactions have the most difference in impact (before we finalize the selection)). Then, in terms of interpreting the unequal slopes, the visual is a good way to show how different the groups are.

9) Extra Credit: Feel free to explore models that have other continuous variables, as well as interactions of categorical variables. The more you do, the more extra credit you can accumulate.

Unfortunately, life got the best of me this week, so I dont have time to try other variables, but if given the time, I would probably try a couple things:
1. figure out some sort of loop to loop through a few more variables.
2. maybe try the interactions of 2 interactions?